Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Merged

Conversation

gustavogama-cll
Copy link
Contributor

@gustavogama-cll gustavogama-cll commented Dec 18, 2024

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.

After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.

This PR attempts to solve this by adding a "retry" wrapper on top of the existing SyncNodeInfo method. We rely on avast/retry-go to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.

DPA-1371

Copy link
Contributor

github-actions bot commented Dec 18, 2024

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 3 times, most recently from 301972c to 83b1842 Compare December 18, 2024 05:51
@graham-chainlink
Copy link
Collaborator

graham-chainlink commented Dec 19, 2024

Hmm i wonder would this solve the connection issue?

If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right?

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from 57b55cc to c5d0079 Compare December 20, 2024 04:34
@gustavogama-cll gustavogama-cll changed the title feat(job-distributor): periodically sync node info with job distributors feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() Dec 20, 2024
@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from a1a4281 to 61297ab Compare December 20, 2024 05:17
@gustavogama-cll
Copy link
Contributor Author

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the context instances to get there.

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

@graham-chainlink
Copy link
Collaborator

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler.

@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 61297ab to 5c30694 Compare December 27, 2024 02:32
@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 6 times, most recently from e540377 to b2f8386 Compare January 15, 2025 03:35
@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from b2f8386 to 4ff90cc Compare February 19, 2025 22:16
Copy link
Contributor

Flakeguard Summary

Ran new or updated tests between develop and 4ff90cc (dpa-1371-feat-periodic-sync-node-info-job-distributor).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

2 Results
Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
Test_Service_syncNodeInfoWithRetry 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/feeds false 0s @smartcontractkit/deployment-automation, @smartcontractkit/core
Test_Service_syncNodeInfoWithRetry/more_errors_than_MaxAttempts 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/feeds false 1.016666666s @smartcontractkit/deployment-automation, @smartcontractkit/core

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 4ff90cc to cd86fbb Compare February 19, 2025 22:53
@gustavogama-cll gustavogama-cll marked this pull request as ready for review February 20, 2025 03:05
@gustavogama-cll gustavogama-cll requested a review from a team as a code owner February 20, 2025 03:05
@gustavogama-cll gustavogama-cll requested review from a team as code owners February 20, 2025 03:05
@gustavogama-cll gustavogama-cll marked this pull request as draft February 20, 2025 22:34
@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from cd86fbb to 8fa7efc Compare February 20, 2025 22:35
Copy link
Contributor

Flakeguard Summary

Ran new or updated tests between develop and 8fa7efc (dpa-1371-feat-periodic-sync-node-info-job-distributor).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

2 Results
Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
Test_Service_syncNodeInfoWithRetry 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/feeds false 0s @smartcontractkit/deployment-automation, @smartcontractkit/core
Test_Service_syncNodeInfoWithRetry/more_errors_than_MaxAttempts 0% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/feeds false 1.016666666s @smartcontractkit/deployment-automation, @smartcontractkit/core

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 8fa7efc to 6abb8aa Compare February 21, 2025 05:02
@gustavogama-cll gustavogama-cll marked this pull request as ready for review February 21, 2025 05:32
Copy link
Collaborator

@graham-chainlink graham-chainlink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

There’s a behavior that we’ve observed for some time on the NOP side
where they will add/update a chain configuration of the Job
Distributor panel but the change is not reflected on the service itself.
This leads to inefficiencies as NOPs are unaware of this and thus need
to be notified so that they may "reapply" the configuration.

After some investigation, we suspect that this is due to connectivity
issues between the nodes and the job distributor instance, which causes
the message with the update to be lost.

This PR attempts to solve this by adding a "retry" wrapper on top of the
existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement
the bulk of the retry logic. It's configured with a minimal delay of
10 seconds, maximum delay of 30 minutes and retry a total of 56 times --
which adds up to a bit more than 24 hours.

Ticket Number: DPA-1371
@gustavogama-cll gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 6abb8aa to a42860e Compare February 24, 2025 19:50
@gustavogama-cll gustavogama-cll added this pull request to the merge queue Feb 26, 2025
Merged via the queue into develop with commit 39d0909 Feb 26, 2025
167 checks passed
@gustavogama-cll gustavogama-cll deleted the dpa-1371-feat-periodic-sync-node-info-job-distributor branch February 26, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants